-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for clipping reads that extend past their mate #761
Conversation
// Adjust the insert size based on clipping on the 5' end of the given read. | ||
// This *does not* consider any clipping on the 5' end of its mate. | ||
val fivePrimeClipping = cigar.takeWhile(_.operator.isClipping).filter(_.operator == CigarOperator.S).sumBy(_.length) | ||
val adjustedInsertSize = abs(rec.insertSize) + fivePrimeClipping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr: using the insert size does not account for inserted bases!
@@ -532,6 +532,20 @@ class VanillaUmiConsensusCallerTest extends UnitSpec with OptionValues { | |||
s2.cigar.toString shouldBe "46S96M" | |||
} | |||
|
|||
it should "not to trim based on insert size in the presence of soft-clipping (+/-)" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just swapping R1 and R2 from the previous test
@@ -560,4 +574,18 @@ class VanillaUmiConsensusCallerTest extends UnitSpec with OptionValues { | |||
source.cigar.toString() shouldBe "3M" | |||
source.sam shouldBe Some(rec) | |||
} | |||
|
|||
it should "not trim reads with an insert size shorter than the read length when there's an insertion in the middle" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the new test to confirm the new behavior. Previously, the insert size would be calculated as 80bp, so the consensus read would have been 80bp. This happened because the insert size doesn't count inserted bases in the insert. The correct length of the consensus bases is 100bp.
// 3' end. If not, then we can use the full read. This *does not* consider any clipping on the 5' end of its mate. | ||
val adjustedReadLength = { | ||
if (rec.positiveStrand) { | ||
val mateEnd = rec.mateEnd.getOrElse(rec.start + abs(rec.insertSize) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If MC isn't present, this relies on the aligner setting insert size correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use CoordMath.getEnd
here?
Codecov Report
@@ Coverage Diff @@
## master #761 +/- ##
==========================================
- Coverage 95.47% 95.45% -0.03%
==========================================
Files 121 121
Lines 6855 6907 +52
Branches 463 479 +16
==========================================
+ Hits 6545 6593 +48
- Misses 310 314 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might need to be walked through this PR. I understand the bug/goal, but have questions about the implementation.
I don't understand why the else
branches are the way they are. It seems to be that they will now always remove 3' soft-clipping and I don't think that's desirable. I think there are three conceptual cases we need to walk through and make sure we agree on the expected outcome:
- A read in an FR pair is aligned such that it has aligned bases that extend past the "far" end of it's mate. This is simple, we know we want to trim any bases past the mate's far end.
- A read in an FR pair is aligned such that it doesn't have aligned bases that extend past the "far" end of it's mate, but is close and has soft-clipping that if converted to an
M
cigar, would put it past it's mate end. I'm not sure what to do in this case - the old code would have trimmed, but I'm not entirely sure that's the right thing to do. - Reads are far apart (e.g. 2x150 with 350bp insert size) but have soft-clipping at the 3' end. In this case I believe we do not want to remove the clipping.
// 3' end. If not, then we can use the full read. This *does not* consider any clipping on the 5' end of its mate. | ||
val adjustedReadLength = { | ||
if (rec.positiveStrand) { | ||
val mateEnd = rec.mateEnd.getOrElse(rec.start + abs(rec.insertSize) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use CoordMath.getEnd
here?
rec.readPosAtRefPos(pos=mateEnd, returnLastBaseIfDeleted=false) | ||
} else { | ||
// Find where the end of the read goes past the mate start, if at all | ||
val refPosAtReadEnd = min(mateEnd, rec.end) // last 3' base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? We're in an else under rec.end >= mateEnd
therefore rec.end will always be <= mateEnd and therefore the value returned. Am I missing something?
} else { | ||
// Find where the end of the read goes past the mate start, if at all | ||
val refPosAtReadEnd = min(mateEnd, rec.end) // last 3' base | ||
rec.readPosAtRefPos(pos=refPosAtReadEnd, returnLastBaseIfDeleted=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a while since I've been through this code, but this will remove any trailing clipping, insertions, and maybe deletions too. Is that intentional? I.e. if the cigar is 80M20S
, say because the read runs off into a large insertion, this will return 80
, and cause the read to get trimmed to 80 bases, no?
rec.readPosAtRefPos(pos=refPosAtReadEnd, returnLastBaseIfDeleted=false) | ||
} | ||
} else { | ||
// Find where the start of the read that goes before the mate start, if at all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Find where the start of the read that goes before the mate start, if at all | |
// Find where the start of the read goes before the mate start, if at all |
Just saying out loud that I find these comments hard to grok because they use start/end as then are in the SAM file, as opposed to e.g. if you're looking at reads in IGV (where directionality/strand is conveyed) I think we tend to flip start/end for reverse reads to match sequencing order.
If the reads are longer than the insert size, but aligned with insertions, the consensus read may have the insertion trimmed.
Adds methods to SamRecordClipper to clip bases that map beyond the end of the mate for FR pairs. Use these methods in ClipBam to enable the option `--clip-bases-past-mate`. Use these method in consensus calling to properly decide the number of bases to clip based on insert size for FR pairs.
388ba43
to
13d32a8
Compare
@@ -187,7 +188,7 @@ class SamRecordClipper(val mode: ClippingMode, val autoClipAttributes: Boolean) | |||
} | |||
|
|||
/** | |||
* Attempts to clip an additional numberOfBasesToClip from the 5' end of the read. For | |||
* Attempts to clip an additional numberOfBasesToClip from the 3' end of the read. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc fix, unrelated but found while reading the code
@@ -233,7 +234,7 @@ class SamRecordClipper(val mode: ClippingMode, val autoClipAttributes: Boolean) | |||
/** | |||
* Ensures that there are at least clipLength bases clipped at the 5' end | |||
* of the read, _including_ any existing soft and hard clipping. Calculates any | |||
* additional clipping and delegates to [com.fulcrumgenomics.bam.SamRecordClipper.[clipStartOfAlignment]]. | |||
* additional clipping and delegates to [[com.fulcrumgenomics.bam.SamRecordClipper.clipStartOfAlignment]]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc fix
@@ -162,7 +163,7 @@ class SamRecordClipper(val mode: ClippingMode, val autoClipAttributes: Boolean) | |||
val oldElems = rec.cigar.elems.reverse | |||
val oldBases = rec.bases.reverse | |||
val oldQuals = rec.quals.reverse | |||
val (newElems, readBasesClipped, refBasesClipped, bases, quals) = clip(oldElems, numberOfBasesToClip, oldBases, oldQuals) | |||
val (newElems, readBasesClipped, _, bases, quals) = clip(oldElems, numberOfBasesToClip, oldBases, oldQuals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intellij warning
@@ -289,6 +290,96 @@ class SamRecordClipper(val mode: ClippingMode, val autoClipAttributes: Boolean) | |||
} | |||
} | |||
|
|||
/** Clips overlapping read pairs, where both ends of the read pair are mapped to the same chromosome, and in FR orientation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved from ClipBam
* | ||
* @param rec the read | ||
* @param mate the mate | ||
* @param doNotClip do not clip the reads, only return the number of bases that would be clipped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be convinced to update all the methods to have a mode where they don't actually clip, but just return how many additional bases they would clip.
@@ -440,7 +531,7 @@ class SamRecordClipper(val mode: ClippingMode, val autoClipAttributes: Boolean) | |||
unreachable(s"Cigar $es doesn't contain soft clipping at the start") | |||
} | |||
|
|||
elems = newElems.result | |||
elems = newElems.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala warning
@@ -168,6 +170,9 @@ trait UmiConsensusCaller[ConsensusRead <: SimpleRead] { | |||
/** A consensus caller used to generate consensus UMI sequences */ | |||
private val consensusBuilder = new SimpleConsensusCaller() | |||
|
|||
/** Clipper utility used to _calculate_ clipping, but not do the actual clipping */ | |||
private val clipper = new SamRecordClipper(mode=ClippingMode.Soft, autoClipAttributes=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to have mode that is like "dry run", but wouldn't want that on the command line I don't think (e.g. in ClipBam
). Not sure how to prevent that.
min(adjustedInsertSize, trimToLength) - 1 | ||
// Get the number of mapped bases to clip that maps beyond the mate's end. Soft-clipped bases are not counted. | ||
val numBasesExtendingPasteMateEnd = this.clipper.clipExtendingPastMateEnd(rec=rec, doNotClip=true) | ||
if (numBasesExtendingPasteMateEnd == 0) trimToLength - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this matches the previous behavior. We don't trim if only soft-clipped bases are "beyond the end of the mate"
@@ -281,3 +281,23 @@ class AlignmentTest extends UnitSpec { | |||
alignment.subByTarget(6, 11).cigar.toString() shouldBe "5D1=" | |||
} | |||
} | |||
|
|||
class CigarStatsTest extends UnitSpec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we didn't have tests for these, tsk tsk
@@ -524,4 +524,16 @@ class ClipBamTest extends UnitSpec with ErrorLogLevel with OptionValues { | |||
} | |||
} | |||
} | |||
|
|||
it should "clip FR reads that extend past the mate" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SamRecordClipper
has nice tests, so just wrote the minimal test here to ensure it works in ClipBam
.
@@ -211,7 +211,6 @@ class AnnotateBamWithUmisTest extends UnitSpec { | |||
Io.writeLines(reverseFq, Io.readLines(fq).grouped(4).toSeq.reverse.flatten) | |||
val annotator = new AnnotateBamWithUmis(input=sam, fastq=Seq(fq, reverseFq), readStructure=Seq(ReadStructure("2M4B+M"), ReadStructure("1B+M")), output=out, attribute=umiTag, sorted=true) | |||
val result = intercept[Exception] { annotator.execute() } | |||
println(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's one, set for stun
s2.cigar.toString shouldBe "46S96M" | ||
} | ||
|
||
it should "not trim based on insert size in the presence of soft-clipping (+/-)" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests mirrors the previous one, but swaps R1 and R2
@arg( doc="Clip overlapping reads.", mutex=Array("clipBasesPastMate")) val clipOverlappingReads: Boolean = false, | ||
@arg( doc="Clip reads who sequence past the start of their mate.", mutex=Array("clipOverlappingReads")) val clipBasesPastMate: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it's awkward to make these mutex. Yes, one implies the other, but they don't contradict each other. E.g. I could totally see a pipeline that wants to always set --clip-bases-past-mate
and optionally set --clip-overlapping-reads
, and now it'll have to know to unset the first option based on the second, when it's not strictly necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the mutex, since one is a subset of the other. If this was written from scratch, I'd prefer an enum that was "None", "PastMate", "Overlapping" for a single argument, but here we are. Perhaps we can deprecate clipOverlappingReads
and make an Option[PairClipping]
type, where PairClipping
is PastMate
or Overlapping
. Does that work for you?
src/test/scala/com/fulcrumgenomics/bam/SamRecordClipperTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/fulcrumgenomics/bam/SamRecordClipperTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/fulcrumgenomics/bam/SamRecordClipperTest.scala
Outdated
Show resolved
Hide resolved
rec.cigar.toString shouldBe "100M" | ||
mate.start shouldBe 1 | ||
mate.cigar.toString shouldBe "100M" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want a test with insertions in the cigar strings since that was the original bug?
Also would be nice to have tests with asymmetry - i.e. one read does extend past it's mate but the other doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Copied bug test case from here to this file.
- See the test case "clip reads that where only one end extends their mate's start". Added a second one with insertions
Co-authored-by: Tim Fennell <[email protected]>
Co-authored-by: Tim Fennell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfenne I think this fixes up all the issues, except those that I left unresolved. I am not 100% in love with the way we trim back overlapping reads with deletions, since in some cases (see added tests) selecting a mid point in the overlapping reference window can be within the deletion, so after trimming neither read spans the deletion. For example, two fully overlapping reads that are 50M10D50M
, the mid point is right in the middle of the deletion, so we'll get 50M50S
for one read and 50S50M
for the other. I think it's more work than I am willing to do to try to keep the deletion in one of the two reads.
val numBasesToClip = if (totalClippedBases == 0) 0 else { | ||
// Bases past the end, so get the additional # of bases to clip | ||
val existingClippedBases = { | ||
val iter = if (rec.positiveStrand) rec.cigar.reverseIterator else rec.cigar.iterator | ||
iter.takeWhile(_.operator.isClipping).filter(_.operator == Op.S).sumBy(_.length) | ||
} | ||
Math.max(0, totalClippedBases - existingClippedBases) | ||
} | ||
this.clip3PrimeEndOfAlignment(rec, numBasesToClip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if the read is positive strand (the record isn't assumed to be positive!). Will condition on strand and use EndOfRead or StartOfRead depending.
Summary
This adds:
SamRecordClipper
to clip bases that map beyond the end of the mate for FR pairs.ClipBam
to implement the option--clip-bases-past-mate
.More details on the bug in the consensus caller that could improperly trim small inserts
If the reads are longer than the insert size, but aligned with
insertions, the consensus read may have the insertion trimmed.
See issue #759. Corrected output to include the full length read: